python4science

Lorenzo Drufuca

13/10/2020

Lecture Overview

Scientific Python Ecosystem

NumPy

What is numpy

Reference Manual

NumPy stands for Numerical Python and is the universal standard for working with numerical data in Python.

  • It is a foundamental building block of many other packages (pandas, sklearn…).
  • Its main class is numpy.ndarray
  • It provides functions to operate on arrays and array elements

Arrays

In numpy an array is a collection of elements all of the same type and size, threfore an instance of class ndarray consists of a contiguous one-dimensional segment of computer memory.

Arrays can have multiple dimensions. Array dimensions are called axes

The shape of an array is a tuple of integers giving the size of the array along each dimension

Elements within an array are indexed by tuples of non negative integers

Array vs Lists

  • Array use Less memory:
    • Numpy has a dtype (datatype) for the elements (Stores content as bytestream with a header that describes the content)
    • Each list element can have a different type
  • Arrays are Faster:
    • Numpy functions (np.sum, np.linalg.inv, np.fft.fft) are implemented in C/C++ (Blas, LAPACK, MKL, …)
    • Python list has always the interpreter overhead
  • Arrays are easier to use for numeric problems:
    • Numpy supports matrix operations (np.matmul, np.einsum)

NB there exists a standard python lybrary called array but it only provides one-dimensional arrays with limited functionalities

Array Attributes

Memory layout

  • ndarray.flags Information about the memory layout of the array.
  • ndarray.shape Tuple of array dimensions.
  • ndarray.strides Tuple of bytes to step in each dimension when traversing an array.
  • ndarray.ndim Number of array dimensions.
  • ndarray.data Python buffer object pointing to the start of the array’s data.
  • ndarray.size Number of elements in the array.
  • ndarray.itemsize Length of one array element in bytes.
  • ndarray.nbytes Total bytes consumed by the elements of the array.
  • ndarray.base Base object if memory is from some other object.

Data Type

  • ndarray.dtype Data-type of the array’s elements

Other Attibutes

  • ndarray.T The transposed array
  • ndarray.flat A 1-D iterator over the array

create an array

Arrays can be generated from sequences of data (lists/tuples)

import numpy as np

# from a sequence of data
np.array([1,2,3,4]) # [1,2,3,4]
np.array([(0,1,2),(1,2,3)] # [[0,1,2],[1,2,3]]

Arrays can be generated by specific functions

arange, linspace, zeros, zeros_like, ones, ones_like, empty, empty_like…

import numpy as np

a = np.zeros(shape=(2,3)) # [[0,0,0],[0,0,0]]

np.ones_like(a) # [[1,1,1],[1,1,1]]

np.arange(start=1, stop=5) # [1,2,3,4]

np.linspace(start=0,stop=1,num=5) # [0,0.25,0.5,0.75,1]

# Read array from file
b = np.fromfile('path/to/file', sep='\t', dtype=int)

NB numpy.empty creates an array of the desired shape filled with random numbers

Arrays can be generated by a specific module

numpy.random allows the efficient random sampling of data from reference distributions (e.g. uniform, normal, binomilal etc…)

import numpy.random as rand
# Generate an array of shape (2,3) 
# populated with random floats from the open interval [0,1)
# [uniform distribution]
rand.random_sample((2,3))

# Generate array of shape (4,2) 
# populated with random integers in the open inteval [5,11)
# [discrete uniform distriburion]
rand.randint(low=5,high=11,size=(4,2))

# Draw a sample from a normal distribution with mean=0 and sd=1
rand.normal(loc=0, scale=1, size=100)

Accessing array elements

Indexing / Slicing / Iterating

One-dimensional array object can be indexed, sliced and iterated over like python lists

a = np.arange(5) # [0,1,2,3,4]

b = np.array([i**2 for i in a]) # [0,1,4,9,16]

Multi-dimensional array object can have one index or slice per axis. These indices are given in a tuple separated by commas

a = np.fromfunction(lambda x,y: 10*x+y,(4,4),dtype=int) 
# [[0,1,2,3],[10,11,12,13],[20,21,22,23],[30,31,32,33]]

# First row, second element
a[0,1] # 1

# Last two rows, all elements
a[-2:,:] # [[20,21,22,23],[30,31,32,33]]

When fewer indices are provided than the number of axes, the missing indices are considered complete slices

# Third row
a[2] # interpreted as a[2,:]
# [20,21,22,23]

Numpy array can be indexed also using arrays of integers or arrays of booleans

a = np.random.randint(10,100,10) # [82,19,13,73,63,44,43,53,62,15]

# Index using array of integers
idx = np.array([0,3,7,2,2])
a[idx] # [82,73,53,13,13]

# Index using array of booleans > must have shame shape as a!
bool_idx = np.array([i%2==0 for i in a])
a[bool_idx] # [82,44,62]

The dots (…) represent as many colons as needed to produce a complete indexing tuple

b = np.array([[[0,1,2],[1,2,3]], [[2,3,4],[3,4,5]]])

b[1, ... ,0] # interpreted as b[1,:,0]
# [2,3]

Iterating over multidimensional arrays is done with respect to the first axis

a = np.fromfunction(lambda x,y: 10*x+y,(4,4),dtype=int)
# [[0,1,2,3],[10,11,12,13],[20,21,22,23],[30,31,32,33]]

for i in a:
    print(i)

# [0,1,2,3]
# [10,11,12,13]
# [20,21,22,23]
# [30,31,32,33]

However it is possible to iterate over all elements of an array using the flat generator

a = np.arange(5) # [0,1,2,3,4]

for i in a.flat:
    print(i)
    
# 0
# 1
# 2
# 3
# 4

Array mutability

numpy.array are mutable objects

a = np.arange(5) # [0,1,2,3,4]

a[1] = 10 # [0,10,2,3,4]
a[3]+= 2 # [0,10,2,5,4]

NumPy functions, as well as operations like indexing and slicing, will return views (shallow copy) whenever possible. Mutating a view, also affects parent array.

a = np.arange(5) # [0,1,2,3,4]

b = a[:2] # [0,1]
b[1] = 7 # [0,7]

print(a) # [0,7,2,3,4]

In fact both arrays point to the same chunk of memory

id(b.data)==id(a.data) # True

To remove reference to parental arrays we can use the copy method

a = np.arange(5) # [0,1,2,3,4]

b = a # only the link is passed 
id(b) == id(a) # True

b = a[:2} # [0,1]
id(b) == id(a) # False
b.base == a # True

b = a[2:].copy() # [2,3,4]
id(b) == id(a) # False
b.base == a # False

b[0] = 5 # [5,3,4]
print(a) # [0,1,2,3,4]

Adding / Removing elements from an array

Adding one element at a time to an array is not advised.

Instead, grow a collection and convert to an array afterwards.

# Suppose wa have the following list of strings
a = ['banana', 'ananas', 'orange', 'apple', 'kiwi']

# We are interested in knowing the average length of fruit names that start with 'a'

# Since we don't know the number of such fruits a priori, first grow a list
b = [len(fruit) for fruit in a if fruit.startswith('a')]

# Then convert the resulting list to array
b = np.array(b)

# Get the mean of the array
b.mean() # 5.5

Alterantively, if the number of elements is known beforehand, initialize an array with desired size and shape and progressively fill it.

# Suppose we have the same list
a = ['banana', 'ananas', 'orange', 'apple', 'kiwi']

# We are now interested in knowing the average length of fruit names

# We could initialize an array of the desired lenght
l = len(a)
b = np.zeros(l) # [0,0,0,0,0]

# Iterate over a and populate b
for i in range(l):
    b[i] = len(a[i])
    
# [6,6,6,5,4]

b.mean() # 5.4

Arrays of the correct size can be stacked together along different axes

  • vstack stack sequence of array along first axis
a = np.fromfunction(lambda x,y: 10*x+y, (2,2)) # [[0,1],[10,11]]
b = np.fromfunction(lambda x,y: x+2*y, (2,2)) # [[0,2],[1,3]]

np.vstack((a,b)) # [[0,1],[10,11],[0,2],[1,3]]
  • hstack stack sequence of array along second axis [except 1D array!]
np.hstack((a,b)) # [[0,1,0,2],[10,11,1,3]]

# Except 1D array!!!
np.hstack((a[0,:], b[0,:])) # [0,1,0,2]
  • column_stack stack 1D arrays as columns into a 2D array
np.column_stack((a[0,:], b[0,:])) # [[0,0],[1,2]]
  • stack stack sequence of array along new axis
  • concatenate stack sequence of array along existing axis
  • block assemble a ndarray from nested lists of blocks

Shaping an array

The array attributearray.shape returns a tuple with the size of the array along each axis

The array method array.reshape allows to change the shape of an array to an arbitrary one, provided that the new shape is compatible with array.size

a = np.arange(10) # [0,1,2,3,4,5,6,7,8,9]

a.reshape((2,5)) # [[0,1,2,3,4],[5,6,7,8,9]]

a.reshape((5,2)) # [[0,1],[2,3],[4,5],[6,7],[8,9]]

a.reshape((5,3)) # ValueError: cannot reshape array of size 10 into shape (5,3)

The numpy object numpy.newaxis can be used to extend the dimension of an array

a = np.arange(5) # [0,1,2,3,4]

# Make a row vector of a 
a[np.newaxis,:] # [[0,1,2,3,4]]

# Make a column vector of a
a[:,np.newaxis] # [[0],[1],[2],[3],[4]]

The numpy function np.expand_dims() allows to artificially insert an axis at the specified index.

a = np.arange(3) # [0,1,2]

# Make a row vector of a
np.expand_dims(a, axis=0) # [[0,1,2]]

# Make a column vector of a
np.expand_dims(a, axis=1) # [[0],[1],[2]]

Array Methods

An darray object has many methods which operate on or with the array in some fashion, typically returning an array result

Array conversion

  • ndarray.item(args) Copy an element of an array to a standard Python scalar and return it
  • ndarray.tolist() Return the array as an nested list of Python scalars
  • ndarray.itemset(args) Insert scalar into an array
  • ndarray.tobytes([order]) Construct Python bytes containing the raw data bytes in the array
  • ndarray.tofile(fid[, sep, format]) Write array to a file as text or binary
  • ndarray.dump(file) Dump a pickle of the array to the specified file
  • ndarray.dumps() Returns the pickle of the array as a string
  • ndarray.astype(dtype[, order, casting, …]) Copy of the array, cast to a specified type
  • ndarray.byteswap([inplace]) Swap the bytes of the array elements
  • ndarray.copy([order]) Return a copy of the array
  • ndarray.view([dtype][, type]) New view of array with the same data
  • ndarray.getfield(dtype[, offset]) Returns a field of the given array as a certain type
  • ndarray.setflags([write, align, uic]) Set array flags
  • ndarray.fill(value) Fill the array with a scalar value

Shape manipulation

  • ndarray.reshape(shape[, order]) Returns an array containing the same data with a new shape
  • ndarray.resize(new_shape[, refcheck]) Change shape and size of array in-place
  • ndarray.transpose(axes) Returns a view of the array with axes transposed
  • ndarray.swapaxes(axis1, axis2) Return a view of the array with axis1 and axis2 interchanged
  • ndarray.flatten([order]) Return a copy of the array collapsed into one dimension
  • ndarray.ravel([order]) Return a flattened array
  • ndarray.squeeze([axis]) Remove single-dimensional entries from the shape of a

Item selection and manipulation

  • ndarray.take(indices[, axis, out, mode]) Return an array formed from the elements of a at the given indices
  • ndarray.put(indices, values[, mode]) Set a.flat[n] = values[n] for all n in indices
  • ndarray.repeat(repeats[, axis]) Repeat elements of an array
  • ndarray.choose(choices[, out, mode]) Use an index array to construct a new array from a set of choices
  • ndarray.sort([axis, kind, order]) Sort an array in-place
  • ndarray.argsort([axis, kind, order]) Returns the indices that would sort this array
  • ndarray.searchsorted(v[, side, sorter]) Find indices where elements of v should be inserted in a to maintain order
  • ndarray.nonzero() Return the indices of the elements that are non-zero
  • ndarray.compress(condition[, axis, out]) Return selected slices of a along given axis
  • ndarray.diagonal([offset, axis1, axis2]) Return specified diagonals

Calculation

Many of these methods take an argument named axis.

If axis is None (the default), the array is treated as a 1-D array and the operation is performed over the entire array

If axis is an integer, then the operation is done over the given axis (for each 1-D subarray that can be created along the given axis).

  • ndarray.max([axis, out, keepdims, initial, …]) Return the maximum along a given axis
  • ndarray.argmax([axis, out]) Return indices of the maximum values along the given axis
  • ndarray.min([axis, out, keepdims, initial, …]) Return the minimum along a given axis
  • ndarray.argmin([axis, out]) Return indices of the minimum values along the given axis of a
  • ndarray.ptp([axis, out, keepdims]) Peak to peak (maximum - minimum) value along a given axis
  • ndarray.clip([min, max, out]) Return an array whose values are limited to [min, max]
  • ndarray.round([decimals, out]) Return a with each element rounded to the given number of decimals
  • ndarray.trace([offset, axis1, axis2, dtype, out]) Return the sum along diagonals of the array
  • ndarray.sum([axis, dtype, out, keepdims, …]) Return the sum of the array elements over the given axis
  • ndarray.cumsum([axis, dtype, out]) Return the cumulative sum of the elements along the given axis
  • ndarray.mean([axis, dtype, out, keepdims]) Returns the average of the array elements along given axis
  • ndarray.var([axis, dtype, out, ddof, keepdims]) Returns the variance of the array elements, along given axis
  • ndarray.std([axis, dtype, out, ddof, keepdims]) Returns the standard deviation of the array elements along given axis
  • ndarray.prod([axis, dtype, out, keepdims, …]) Return the product of the array elements over the given axis
  • ndarray.cumprod([axis, dtype, out]) Return the cumulative product of the elements along the given axis
  • ndarray.all([axis, out, keepdims]) Returns True if all elements evaluate to True
  • ndarray.any([axis, out, keepdims]) Returns True if any of the elements of a evaluate to True

Arithmetic and comparison operations

Arithmetic and comparison operations on ndarrays are defined as element-wise operations, and generally return ndarray objects

Comparison

  • ndarray.__lt__(self, value, /) Return self<value
  • ndarray.__le__(self, value, /) Return self<=value
  • ndarray.__gt__(self, value, /) Return self>value
  • ndarray.__ge__(self, value, /) Return self>=value
  • ndarray.__eq__(self, value, /) Return self==value
  • ndarray.__ne__(self, value, /) Return self!=value

Unary

  • ndarray.__neg__(self, /) -self
  • ndarray.__pos__(self, /) +self
  • ndarray.__abs__(self) equals to np.abs(a)
  • ndarray.__invert__(self, /) ~self

Arithmetic (returning new array)

  • ndarray.__add__(self, value, /) Return self+value
  • ndarray.__sub__(self, value, /) Return self-value
  • ndarray.__mul__(self, value, /) Return self*value
  • ndarray.__truediv__(self, value, /) Return self/value
  • ndarray.__floordiv__(self, value, /) Return self//value
  • ndarray.__mod__(self, value, /) Return self%value
  • ndarray.__divmod__(self, value, /) Return divmod(self, value)
  • ndarray.__pow__(self, value[, mod]) Return pow(self, value, mod)
  • ndarray.__and__(self, value, /) Return self&value
  • ndarray.__or__(self, value, /) Return self|value
  • ndarray.__xor__(self, value, /) Return self^value
  • ndarray.__matmul__(self, value, /) Return self**@**value

Arithmetic (inplace)

  • ndarray.__iadd__(self, value, /) Return self+=value
  • ndarray.__isub__(self, value, /) Return self-=value
  • ndarray.__imul__(self, value, /) Return self*=value
  • ndarray.__itruediv__(self, value, /) Return self/=value
  • ndarray.__ifloordiv__(self, value, /) Return self//=value
  • ndarray.__imod__(self, value, /) Return self%=value
  • ndarray.__ipow__(self, value, /) Return self**=value
  • ndarray.__iand__(self, value, /) Return self&=value
  • ndarray.__ior__(self, value, /) Return self|=value
  • ndarray.__ixor__(self, value, /) Return self^=value.

Numpy ufunc

NumPy provides familiar mathematical functions such as sin, cos, and exp. In NumPy, these are called ufunc (universal functions).

These functions operate elementwise on an array, producing an array as output

a = np.arange(5) # [0,1,2,3,4]

np.power(a, 2) # [0,1,4,9,16]

Vectorization

Vectorization describes the absence of any explicit looping in the code.

Looping and indexing take place “behind the scenes” in optimized, pre-compiled C code.

Vectorized code has many advantages, among which a- vectorized code is more concise and easier to read - fewer lines of code generally means fewer bugs - the code more closely resembles standard mathematical notation - vectorization results in more “Pythonic” code. Without vectorization, many for loops would be needed.

Array Broadcasting

Broadcasting is the term used to describe the implicit element-by-element behavior of operations

Broadcasting rules

  • First rule of broadcasting: if all input arrays do not have the same number of dimensions, a “1” will be repeatedly prepended to the shapes of the smaller arrays until all the arrays have the same number of dimensions.
A is a 2D array with shape (5,2)
B is a 1D array with shape (5,)
C is a 1D array with shape (2,)

Are A and B compatible for broadcasting?

A and B cannot be broadcasted together:
A: 5 x 2
B:     5

Are A and C compatible for broadcasting?

A and C can be broadcasted together:
A: 5 x 2
C:(1)x 2
  • Second rule of broadcasting: arrays with a size of 1 along a particular dimension act as if they had the size of the array with the largest shape along that dimension. The value of the array element is assumed to be the same along that dimension for the “broadcast” array.

These rules should ensure that array shapes match

Useful Routines

  • Array Sorting
  • Unique Values of Array

Array Sorting

a = np.random.rand(10)

# Sort a in place
a.sort()

###

a = np.random.rand(10)

# Create a sorted copy of a
b = np.sort(a)

b.flags['OWNDATA'] # True
(b==a).all() # False

# Return the indices that would sort a
idx = np.argosrt(a)

(a[idx]==b).all() # True

Unique Values of Array

a = np.random.randint(1,5,10) # [1,3,1,3,4,3,4,4,3,3]

# Return (sorted) unique values of a
u_a = np.unique(a) # [1,3,4]

# Also return indices of the first occurrence for each value
u_a, u_idx = np.unique(a, return_index=True) # [1,3,4],[0,1,4]
a[u_idx] == u_a # [True, True, True]

# Also return the number of occurrences for each value
u_a, u_n = np.unique(a, return_counts=True) # [1,3,4],[2,5,3]
u_n.sum() == a.size # True

Excercise(s)

Array creation and indexing

Form the 2-D array (without typing it in explicitly):

[[1,  6, 11],
 [2,  7, 12],
 [3,  8, 13],
 [4,  9, 14],
 [5, 10, 15]]

and generate a new array containing its 2nd and 4th rows.

Solution

import numpy as np

a = np.arange(1,16)

a = a.reshape((1,3),order='F')

b = a[[1,3],:]

Arithmetics

Divide each column of the array:

import numpy as np

a = np.arange(25).reshape(5, 5)

elementwise with the array b = np.array([1., 5, 10, 15, 20]).

Hint: np.newaxis

Solution

import numpy as np

a = np.arange(25).reshape(5,5)
b = np.array([1,5,10,15,20])

a / b[:,np.newaxis]

Array Sorting

Generate a 10 x 3 array of random numbers in range [0,1). For each row, pick the number closest to 0.5.

Hints

  • Use abs and argsort to find the column j closest for each row.
  • Use fancy indexing: a[i,j] – the array i must contain the row numbers corresponding to stuff in j…

Solution

import numpy as np

a = np.random.rand(10,3)

a_distance = np.abs(a - 0.5)

j = np.argmin(np.argsort(a_distance, axis=1), axis=1)

i = np.arange(j.size)

a[i,j]

Matplotlib

What is Matplotlib

Reference Manual

Matplotlib is the most popular data visualization library in Python

Main objects

  • Figure is the top level container. Can be thought of as a canvas or a white paper over which everithing is drawn. A figure can hold multiple axes
  • Axes are delimited regions within a figure where all plotting elements are rendered. An axes can host multiple artists
  • Artists are objects that draw elements within axes
  • Axis are reference coordinate systems for plotting within axes

Plotting Philosophies

Functional Approach

Create plots by simply calling routines from pyplot module. Generally quicker

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(1,10,20)
y = x**2

plt.plot(x,y, 'blue')

Object Oriented Approach

Explicitly creates a Figure object and relies on its methods. More robust and grants greater control over features.

import numpy as np
import matplotlib.pyplot as plt

f = plt.figure()

ax = f.add_axes([0.1, 0.2, 0.8, 0.9])

ax.plot(x,y,'red')

Multiple plots

Multiple calls to a plotting function add artists to the current axes

plt.plot(x,y, 'blue')
plt.plot(-x, y, 'green')

It is possible to create multiple axes within the same figure.

The function plt.subplots() takes as input two integers (nrows and ncols) and returns a figure with the specified number of axes initialized

fig, axes = plt.subplots(2,2)

Axes generated by the subplots method are stored in a numpy.array and can be iterated over to add content

fig, axes = plt.subplots(2,2) 

x = np.linspace(1,10,20)
c=0

for ax in axes.flat:
    c+=1
    ax.plot(x, x**c)

More complex layouts can be achieved by specifying grid layout with gridspec

f = plt.figure(constrained_layout=True) # Instantiate Figure

gs =f.add_gridspec( # Instantiate grid
    nrows=3, ncols=3, # number of rows/columns in grid
    left= 0.2, right=0.8 , top = 0.9, bottom = 0.1, # figure spanning of grid
    wspace=0.05, hspace=0.03, # spacing between grid elements
    width_ratios = [0.6,0.3, 0.1], height_ratios = [0.2,0.3,0.5], # relative width/height of grid elements
    )

ax1 = f.add_subplot(gs[0,:]) # instantiate axes spanning first row of grid
ax2 = f.add_subplot(gs[1,0]) # second axes in pos (1,0)
ax3 = f.add_subplot(gs[1,1]) # third axes in (1,1)
ax4 = f.add_subplot(gs[1:,-1]) # fourth spanning from second row on, last column
ax5 = f.add_subplot(gs[2,:2]) # last axes covers third row, first two columns

Plotting Routines

pyplot offer multiple routines to plot data

  • axline/axhline/axvline/hlines/vlines adding lines to a plot
  • bar/barh plots data as a bar plot either vertically or horizontally
  • boxplot box and whiskers distribution
  • hist/hist2d to display data as histograms
  • pie pieplot of data
  • plot xy plot as line
  • scatter scatterplot of data as markers
  • violinplot density distribution of data

Plot

  • Is the most versatile of plotting routines and yields scatter plots in which points are by default connected by a line

  • It is very useful in displaying trending or temporal data

  • Accepts as little as one parameter that is interpreted as a y vector

  • Allows extensive customization of the output

p = [0.1,0.2,0.3,0.4]

fig , (ax1,ax2) = plt.subplots(1,2)

ax1.plot(p) # p is treated as y, x = np.arange(p.size)

ax2.plot(x,y, ':ro') # format string specifies 'dotted line, red color, round marker'
ax2.plot(x,0.5*y, linestyle='solid', color='green', marker='x') # explicit formatting

There’s a convenient way for plotting objects with labelled data (i.e. data that can be accessed by index obj['y']). Instead of giving the data in x and y, you can provide the object in the data parameter and just give the labels for x and y

fig , ax = plt.subplots(1,1)

d = {
    'independent_variable':x,
    'response_variable':y
}

ax.plot('independent_variable', 'response_variable', data = d)

It is also possible to create a plot using categorical variables. Matplotlib allows you to pass categorical variables directly to many plotting functions

p = np.random.rand(4)
l = ['A','B','C','D']

fig , ax = plt.subplots(1,1)

ax.plot(l,p, 'o')

Adding Lines to a plot

  • axline draws an infinite line passing from xy1 xy2 or xy1 and slope
  • axhline/axvline draw horizontal/vertical lines spanning all axes length
  • hlines/vlines add a set of lines with specified span
fig , ax = plt.subplots(1,1)

ax.scatter(x,y)

ax.axline((1,2), slope=1, color='red')
ax.axvline(x=2, color='green') # same as ax.axline((2,0), slope=np.inf)
ax.axhline(y=60, color='blue') # same as ax.axline((0,60), slope=0)
ax.vlines(x, 0, y, color = 'gray')

BarPlot

A barplot is a graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.

fig , ax = plt.subplots(1,1)

ax.bar(
    x = ['A','B','C','D'], # bar positions along x axis
    height = p, # bar heights
    width = 0.8, # width(s) of bar(s)
    bottom = None, # The vertical baseline (default 0 if None)
    align = 'center' # bar alignment relative to xposition
)

BoxPlot

fig , ax = plt.subplots(1,1)

a = np.random.normal(0,1,100)
b = np.random.normal(2,1,100)

ax.boxplot(
    x= [a,b], 
    sym = '',
    vert = True, # vertical or horizontal box
    whis = (5,95), # whisker range
    labels = ['A','B'], # Data Labels
    boxprops = dict(linewidth=3), #  dict of styling options for boxes
    # whiskerprops: a dict of styling options for whiskers,
    # medianprops: a dixt of styling options for median,
    # ...
)

Histograms

fig , ax = plt.subplots(1,1)

ax.hist(
    (a,b),
    label = ['A','B'],
    color = ['red','blue'],
    bins = 25, # number of bins, could also supply sequence with bin boundaries (allows inequal spacing)
    range = (-5,5), # define the range for binning
    stacked = False, # if True stack series of data
    cumulative = False, # if True each bin gives the counts in that bin plus all bins for smaller values
    # orientation: either 'vertical' or 'horizontal'
)

PiePlot

p = [0.1,0.2,0.3,0.4]

fig , ax = plt.subplots(1,1)

ax.pie(
    p,
    explode=[0.2,0.2,0.2,0.2],
    labels=['A','B','C','D'],
    colors=None, # Allows to specify colors
    autopct=lambda x: f'{x:.3}', # formatting slice data
    pctdistance=0.6, # distance from center to data annotation
    shadow=True,
    labeldistance=1.1, # distance from center to label
    startangle=0, # start plotting
    radius=1, # pie radius
    counterclock=True, # ordering of slices
    wedgeprops=None, # styling slices
    textprops=None, # styling annotations
    center=(0, 0), # locating the pie
    frame=False,
    rotatelabels=False, # adapt label angle to slice 
)

Scatter

Offer a convenient way to visualize how two numeric values are related in data

Main difference with plot method is that scatter allows to specify size and color parameter for each point

fig , ax = plt.subplots(1,1)

ax.scatter(x,y, 
           s=10*x, # size specification proportional to x
           c=-y,# color specification inversely proportional to y
           cmap='Reds' # specify colormap
          )

ViolinPlot

  • A violin plot is similar to a box and whisker plot.
  • It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.
  • Unlike a box plot, in which all of the plot components correspond to actual datapoints, the violin plot features a kernel density estimation of the underlying distribution.
fig , ax = plt.subplots(1,1)

ax.violinplot(
    [a,b],
    positions=None, # control positioning along x-axis
    vert=True, # vertical or horizontal plot
    widths=0.5, # max width of violin
    showmeans=False, # show the mean value of distributions
    showextrema=True, # show outliers
    showmedians=False, # show median of distribution
    quantiles= None, # show quantiles
    points=100, # n of points for density estimation
    bw_method=None # method for kernel bandwidth estimation
)

Artists

Artist is a class provided by matplotlib for objects that render into a figure

All visible elements in a Figure are subclasses of Artist

Line2d

class matplotlib.lines.Line2D(xdata, ydata, linewidth=None, linestyle=None, color=None, marker=None, markersize=None, markeredgewidth=None, markeredgecolor=None, markerfacecolor=None, markerfacecoloralt=‘none’, fillstyle=None, antialiased=None, dash_capstyle=None, solid_capstyle=None, dash_joinstyle=None, solid_joinstyle=None, pickradius=5, drawstyle=None, markevery=None, **kwargs)[source]

This is the main artist used by the plot function

Consequently, all the above parameters (and more) can be passed to the plot function to set the appearance of line and markers

Patches

Patches is a submodule of matplotlib containg numerous artists to draw

  • Arc(xy, width, height[, angle, theta1, theta2])
  • Arrow(x, y, dx, dy[, width])
  • Circle(xy[, radius])
  • CirclePolygon(xy[, radius, resolution])
  • ConnectionPatch(xyA, xyB, coordsA[, …])
  • Ellipse(xy, width, height[, angle])
  • Patch([edgecolor, facecolor, color, …])
  • PathPatch(path, **kwargs)
  • Polygon(xy[, closed])
  • Rectangle(xy, width, height[, angle])
  • RegularPolygon(xy, numVertices[, radius, …])
  • Shadow(patch, ox, oy[, props])
  • Wedge(center, r, theta1, theta2[, width])

Decorate Plots

Titles and subtitles

  • You can set a title to the figure with figure.suptitle(str) or plt.suptitle(str)
  • Axes title can be specified by axes.set_title(str) or plt.title(str)
fig,axes = plt.subplots(2,1)

fig.suptitle('This is figure title')
axes[0].set_title('this is axes title')

Axis ticks, labels and names

fig = plt.subplots(1,1)

ax.set_xlabel('This is x axis') # equivalent to plt.xlabel()

ax.set_xticks([0,0.5,1])
ax.set_xticklabels(['A','B','C'])

ax.set_xticks([0.25,0.75], minor=True) # equivalent to plt.xticks(loc)
ax.set_xticklabels(['ab','bc'], minor=True) # equivalent to plt.xticks(loc, labels)

 # to remove axis ticks simply pass [] to methods

NB most of the .set_ methods also have a .get_ counterpart

Colorbar

To specify colorbars for a plot, it is required to provide a mappable like the output of imshow or scatter methods

fig , ax = plt.subplots()

l = ax.scatter(x,y, c=-y, cmap='Reds') # Returns a matplotlib.collections.PathCollection object which is mappable

fig.colorbar(l)

Legends

Similar to colorbars, also legends can be added to either figures or axes

The simplest call to create a legend is ax.legend() provided that the axes already features labeled artists

fig , (ax1,ax2) = plt.subplots(1, 2, figsize=(10,5))

ax1.scatter(x,y, label='x**2')
ax1.scatter(-x,y)
ax1.legend() # equivalent to plt.legend(), automatically detects handles and labels

h = ax2.scatter(x,y)
h1 = ax2.scatter(-x,y)
ax2.legend(handles=[h,h1], labels=['x**2', '-x**2']) # specify handles and labels

Text Annotations

Also text annotations can be added to plots by means of the plt.annotate function or the ax.annotate method. It takes as input a string and a tuple of coordinates indicating where to put the text.

fig , ax = plt.subplots(1,1)

ax.scatter(x,y)
ax.scatter(2,20, c='red')

ax.annotate('Here is (2,20)', (2,20))
# Annotations can then be enriched with multiple feature (box, connectors...)
ax.annotate('There is (2,20)', (2,20), xytext=(4,80), arrowprops={'arrowstyle':'simple'})

Colormaps

A colormap is essentially a function F(x) -> c where c is a color from a set C of colors available to the colormap

The idea behind choosing a good colormap is to find a good representation in 3D colorspace for your data set

Color can be represented in 3D space in various ways. One way to represent color is using CIELAB. In CIELAB, color space is represented by lightnes (𝐿∗), red-green (𝑎∗) and yellow-blue(𝑏∗)

Sequential maps

Change in lightness and often saturation of color incrementally, often using a single hue; should be used for representing information that has ordering

Builtin Colormaps

‘viridis’, ‘plasma’, ‘inferno’, ‘magma’, ‘cividis’

‘Greys’, ‘Purples’, ‘Blues’, ‘Greens’, ‘Oranges’, ‘Reds’, ‘YlOrBr’, ‘YlOrRd’, ‘OrRd’, ‘PuRd’, ‘RdPu’, ‘BuPu’, ‘GnBu’, ‘PuBu’, ‘YlGnBu’, ‘PuBuGn’, ‘BuGn’, ‘YlGn’

‘binary’, ‘gist_yarg’, ‘gist_gray’, ‘gray’, ‘bone’, ‘pink’,‘spring’, ‘summer’, ‘autumn’, ‘winter’, ‘cool’, ‘Wistia’, ‘hot’, ‘afmhot’, ‘gist_heat’, ‘copper’

Diverging maps

Change in lightness and possibly saturation of two different colors that meet in the middle at an unsaturated color; should be used when the information being plotted has a critical middle value, such as topography or when the data deviates around zero

Builtin Colormaps

‘PiYG’, ‘PRGn’, ‘BrBG’, ‘PuOr’, ‘RdGy’, ‘RdBu’, ‘RdYlBu’, ‘RdYlGn’, ‘Spectral’, ‘coolwarm’, ‘bwr’, ‘seismic’

Cyclic Maps

Change in lightness of two different colors that meet in the middle and beginning/end at an unsaturated color; should be used for values that wrap around at the endpoints, such as phase angle, wind direction, or time of day

Builtin Colormaps

‘twilight’, ‘twilight_shifted’, ‘hsv’

Qualitative Maps

Often are miscellaneous colors; should be used to represent information which does not have ordering or relationships.

Builtin Colormaps

‘Pastel1’, ‘Pastel2’, ‘Paired’, ‘Accent’, ‘Dark2’, ‘Set1’, ‘Set2’, ‘Set3’, ‘tab10’, ‘tab20’, ‘tab20b’, ‘tab20c’

Customize Color Palettes

It is possible to create custom colormaps or modify existing ones.

Matplotlib has two main classes for colormaps: ListedColorMap and LinearSegmentedColorMap

  • ListedColorMap are essentially ‘arrays’ of shape Nx4, where N are the distinct colors available in the colormaps and the four dimensions specify RGBA values (if A==1 for all colors may be omitted). Such colormap is a lookup table, so “oversampling” the colormap returns nearest-neighbor interpolation
  • LinearSegmentedColorMap consist of a spectrum of colors distributed around anchor points

Extract colormap values

The function get_cmap from the submodule matplotlib.cm allows the extraction of builtin colormaps.

For ListedColorMap objects the corresponding list of colors is accessible at cmap.colors

from matplotlib.cm import get_cmap

cmap = get_cmap('tab10')

fig,ax = plt.subplots()
for i,c in enumerate(cmap.colors):
    ax.scatter(i,i,c=[c])

LinearSegmentedColorMap do not have a .colors attribute. However, one may still call the colormap with an integer array, or with a float array between 0 and 1.

cmap = get_cmap('Reds')

fig, ax = plt.subplots()
for i in np.linspace(0,1,10):
    ax.scatter(i,i,c=[cmap(i)])

Define New ColorMaps

Defining a custom ListedColorMap is as easy as calling matplotlib.colors.ListedColorMap(**list**) over a sequence of valid matplotlib colors

from matplotlib.colors import ListedColormap

cmap = ListedColormap(["darkorange", "gold", "lawngreen", "lightseagreen"])

fig, ax = plt.subplots()
for i in np.arange(4):
    ax.scatter(i,i,c=[cmap(i)], s=100)

Define New ColorMaps

Defining custom LinearSegmentedColorMap is slightly more complex and require the specification of anchor points. Each anchor point is specified as a row in a matrix of the form [x[i] yleft[i] yright[i]], where x[i] is the anchor, and yleft[i] and yright[i] are the values of the color on either side of the anchor point.

from matplotlib.colors import LinearSegmentedColormap

cdict = {'red':   [[0.0,  0.0, 0.0], [0.5,  1.0, 1.0], [1.0,  1.0, 1.0]],
         'green': [[0.0,  0.0, 0.0], [0.25, 0.0, 0.0], [0.75, 1.0, 1.0], [1.0,  1.0, 1.0]],
         'blue':  [[0.0,  0.0, 0.0], [0.5,  0.0, 0.0], [1.0,  1.0, 1.0]]}

cmap = LinearSegmentedColormap('customCmap', segmentdata=cdict, N=256)

fig, ax = plt.subplots()
for i in np.linspace(0,1,256):
    ax.scatter(i,i,c=[cmap(i)])

Matplotlib Excercise 1

Generate two random samples and display a beeswarm boxplot/violinplot

Hint add (small) noise to separate data points

Solution

a = np.random.normal(0,1,100)
b = np.random.normal(2,1,100)

n_a = 0.4*np.random.random_sample(a.size) - 0.2 # [a,b) interval -> (b - a) * random_sample() + a
n_b = 0.4*np.random.random_sample(b.size) - 0.2


fig , ax = plt.subplots(1,1)

ax.violinplot([a,b])
ax.scatter(1*np.ones_like(a)+n_a, a, s=2, c='gray')
ax.scatter(2*np.ones_like(b)+n_b, b, s=2, c='gray')

Matplotlib Excercise 2

Given the following array, plot a stacked bar plot of its content

x = np.random.rand(5)
x = x/x.sum()

Hint to obtain a stacked plot specify both bottom and height parameters for each point

Solution

from matplotlib.cm import get_cmap
cm = get_cmap('tab10')
col = []

loc = np.ones_like(tmp)

bottom=[] 

b=0
for i,t in enumerate(tmp):
    col.append(cm(i))
    
    bottom.append(b)
    b+=t

fig , ax = plt.subplots(1,1)

ax.bar(loc,height=tmp, bottom=bottom, color=col)

Pandas

What is Pandas

Reference Manual

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool

Its main classes are Series and DataFrame

Series

Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)

Axis labels are collectively referred to as the index

The basic method to create a Series is to call

import pandas as pd
s = pd.Series(data, index=index)

where data could be one of

  • an ndarray
  • a Python dict
  • a scalar value

And index is a list of axis labels

NB pandas index support non-unique values! Exceptions may be raised at runtime

  • Series from ndarray

If data is an ndarray, index must be the same length as data. . . .

If no index is passed, one will be created having values [0, ..., len(data) - 1]

  • Series from dictionaries

If data is a dict and an index is passed, the values in data corresponding to the labels in the index will be pulled out (with NaN in case key is absent)

If no index is passed, dict keys will be used and the series will be ordered either alfanumerically or by dict insertion order (depending on pandas version)

  • Series from scalars

If data is a scalar value, an index must be provided. The value will be repeated to match the length of index

Pandas Series behave like ndarray

  • they are a collection of items of the same type -> dtype
  • they are valid arguments to most numpy functions

Pandas Series behave like dictionaries

  • Values can be get and set by accessing series through index labels

NB A key difference between Series and ndarray is that operations between Series automatically align the data based on label. Thus, you can write computations without giving consideration to whether the Series involved have the same labels

DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Like Series, DataFrame accepts many different kinds of input:

  • Dict of 1D ndarrays, lists, dicts, or Series
  • 2-D numpy.ndarray
  • Structured or record ndarray
  • A Series

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

From Dictionary of lists / nested dictionaries

>>> d = {
    'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd']) # same as {'a':1.,'b':2.,'c':3.,'d':4.}
}
>>> df = pd.DataFrame(d)
>>> df
   one  two
a  1.0  1.0
b  2.0  2.0
c  3.0  3.0
d  NaN  4.0

>>> pd.DataFrame(d, index=['d', 'b', 'a'])
   one  two
d  NaN  4.0
b  2.0  2.0
a  1.0  1.0

>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d  4.0   NaN
b  2.0   NaN
a  1.0   NaN

From dictionary of array / lists

The 1D-ndarrays/lists must all be the same length.

If an index is passed, it must also be the same length as the arrays.

If no index is passed, the result will be range(n), where n is the array length

>>> d = {
    'one': [1., 2., 3., 4.],
    'two': [4., 3., 2., 1.]
} 
pd.DataFrame(d)
   one  two
0  1.0  4.0
1  2.0  3.0
2  3.0  2.0
3  4.0  1.0

pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
   one  two
a  1.0  4.0
b  2.0  3.0
c  3.0  2.0
d  4.0  1.0

Adding / Removing columns from DataFrame

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations

>>> df['one']
a    1.0
b    2.0
c    3.0
d    NaN
Name: one, dtype: float64

>>> df['three'] = df['one'] * df['two']
>>> df['flag'] = df['one'] > 2
>>> df
   one  two  three   flag
a  1.0  1.0    1.0  False
b  2.0  2.0    4.0  False
c  3.0  3.0    9.0   True
d  NaN  4.0    NaN  False

>>> del df['two']
>>> three = df.pop('three')
>>> df
   one   flag
a  1.0  False
b  2.0  False
c  3.0   True
d  NaN  False
  • When inserting a scalar value, it will naturally be propagated to fill the column
  • When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s index
  • You can insert raw ndarrays but their length must match the length of the DataFrame’s index
  • By default, columns get inserted at the end.To control insertion point insert function is available
  • Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns that are potentially derived from existing columns

Indexing and Selection

Operation Syntax Result
Select Column df[col] Series
Select row by label df.loc[label] Series
Select row by location df.iloc[loc] Series
Slice rows df[start:stop] DataFrame
Select rows by boolean df[bool] DataFrame
Select value by label df.loc[label,col] Scalar
Select value by location df.iloc[ridx,cidx] Scalar

Data Alignment

  • Data Alignment when operating with DataFrames occurs along both index and columns
  • When operating on a Series and a DataFrame the Series index is aligned to DataFrame columns and broadcasted ‘rowwise’

Merge / Join / Concatenate

  • The concat() function allows concatenation along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None,
          levels=None, names=None, verify_integrity=False, copy=True)
  • The merge() function allows Series and DataFrame joining similar to relational databses (SQL)
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)
  • The join() function is basically a wrapper around the merge() function to quickly handle frquent cases

Reshaping

Categorical Data

A categorical variable takes on a limited, and usually fixed, number of possible values

Categorical data might have an order, but numerical operations (additions, divisions, …) are not possible

Internally, the data structure consists of a categories array and an integer array of codes which point to the real value in the categories array.

Creation

  • Series
    • specify dtype="category" when constructing a Series: pd.Series(["a", "b", "c", "a"], dtype="category")
    • convert an existing Series to a category dtype: pd.Series(["a", "b", "c", "a"]).astype('category')
    • use special functions, such as cut(), which groups data into discrete bins:
s = pd.Series(np.random.randint(0, 5, 10))

labels = ["{0} - {1}".format(i, i + 1) for i in range(0, 5, 2)]

pd.cut(s, range(0, 5+2 ,2), right=False, labels=labels)
  • pass a pandas.Categorical object to a Series:
aw_cat = pd.Categorical(["a", "b", "c", "a"], 
    categories=["b", "c", "d"], ordered=False)
    
s = pd.Series(raw_cat)
  • DataFrames
    • during construction by specifying dtype=“category” in the DataFrame constructor
    • all columns in an existing DataFrame can be batch converted using DataFrame.astype()

NB conversions are done column by column!

Working With Categories

Categorical Series have a .cat ‘attribute’ that allows to access methods to handle categories

  • Listing categories: s.cat.categories

NB the result of s.cat.categories and unique(s) are not guarenteed to be equal

  • Renaming categories: assigning new values to the s.cat.categories property or by using the s.cat.rename_categories(*sequence*) method

NB s.cat.rename_categories() is an inplace method

NB different from R factors, categories can be of any dtype [but not null/None/NaN]

  • Adding / Removing categories: s.cat.add_categories()/s.cat.remove_categories()

Values corresponding to removed categories are changed to NaN

s.cat.remove_unused_categories() removes all categories that are not used in s

  • Sorting: categorical data may be ordered. If categorical data is ordered (s.cat.ordered == True) s.sort_values() returns the sorted series using the order defined by categories, not any lexical order present on the data type.
  • Reordering: If categorical data is ordered (s.cat.ordered == True) s.cat.reorder_categories(**sequence**) allows to change category order.

NB All old categories must be included in the new categories and no new categories are allowed

NB reordering affects sorting order!

  • Comparisons: Comparing categorical data with other objects is possible in three cases
    • Comparing equality (== and !=) to a list-like object (list, Series, array, …) of the same length as the categorical data.
    • All comparisons (==, !=, >, >=, <, and <=) of categorical data to another categorical Series, when ordered==True and the categories are the same.
    • All comparisons of a categorical data to a scalar.
  • Merging: Combining Series or DataFrames which contain the same categories results in category dtype, otherwise results will depend on the dtype of the underlying categories. Use .astype to ensure category results.
  • Slicing:
    • If the slicing operation returns either a DataFrame or aSeries, dtype: category is preserved
    • Slicing along one single row does not preserve dtype: category
    • Returning a single item from categorical data will return the value
  • Setting Elements: setting elements of a Series with category dtpye is allowed, provided that added elements are included in categories

Groupby / Apply / Combine

The groupby/apply/combine process involves one or more of the following steps:

  • Splitting the data into groups based on some criteria.
  • Applying a task to each group independently such as
    • Aggregation: compute a summary statistic (or statistics) for each group [Compute group sums or means, Compute group sizes / counts…]
    • Transformation: perform some group-specific computations and return a like-indexed object [Standardize data (zscore) within a group, Filling NAs within groups with a value derived from each group]
    • Filtration: discard some groups, according to a group-wise computation that evaluates True or False [Discard data that belongs to groups with only a few members, Filter out data based on the group sum or mean…]
  • Combining the results into a data structure.

Groupby

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names The mapping can be specified many different ways:

  • A Python function, to be called on each of the axis labels.
  • A list or NumPy array of the same length as the selected axis.
  • A dict or Series, providing a label -> group name mapping.
  • For DataFrame objects, a string indicating a column to be used to group.
  • For DataFrame objects, a string indicating an index level to be used to group.
  • A list of any of the above things.

Grouping a DataFrame is as easy as calling

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})

grouped_df = df.groubpy('A') # in truth this is df.groupby(df['A])

NB No splitting occurs until it’s needed. Calling df.groupby() only creates a GroupBy object only verifies that you’ve passed a valid mapping

  • By default the group keys are sorted during the groupby operation. You may however pass sort=False for potential speedups
  • groupby will preserve the order in which observations are sorted within each group.
  • By default NA values are excluded from group keys during the groupby operation. However, in case you want to include NA values in group keys, you could pass dropna=False to achieve it.

Groupby Object

The result of a df.groupby() call is a GroupBy object.

  • Informations about the grouping are stored in the .groups attribute which is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group.
>>> grouped_df.groups
{'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}
  • GroupBy objects support iteration like a dictionary
for g,group in grouped_df:
    print(g, group.shape)
    
bar (3,4)
foo (5,4)
  • A single group can be selected using GroupBy.get_group(**group**)
grouped_df.get_group('bar')

    A   B       C     D
1 bar one    1.30 -0.97
3 bar three -0.67  0.44
5 bar two   -0.22 -1.44
  • A single columns for each group could be accessed
for g,group in grouped_df['B']:
    print(g,group.values)
    
bar ['one' 'three' 'two']
foo ['one' 'two' 'two' 'one' 'three']

Aggregation

  • GroupBy objects have methods to perform common apply/aggregate tasks [mean, std, sum, size, count, var, sem, describe, first, last, nth, min, max]
grouped_df.size()

A
bar    3
foo    5
dtype: int64

NB numeric functions like sum() will be performed on all compatible columns

  • The GroupBy.agg() function allows to perform more complex aggregations
  • Applying multiple functions: GroupBy.agg() accepts a list of functions (any function that takes a Series and returns a scalar is valid)
>>> grouped_df['C'].agg([np.mean, np.std, lambda x: x.max()-x.min()])

         mean       std  <lambda_0>
A
bar  0.137213  1.039323  1.985601
foo -1.097792  0.814446  1.858923
  • Named Aggregation: column-specific aggregation with control over the output column names is achieved with a special syntax for the .agg() function.
    • The keywords are the output column names
    • The values are tuples whose first element is the column to select and the second element is the aggregation to apply to that column
grouped_df.agg(first_col = ('C', np.min),
               second_col = ('D', 'max'),
               third_col = ('D', lambda x: x.max()))

     first_col  second_col  third_col
A
bar  -0.678068    0.435538   0.435538
foo  -2.108614    2.303290   2.303290

Transformation

The transform method returns an object that has the same size as the one being grouped.

The transform function must:

  • Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk
  • Operate column wise on the group chunk.
  • Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results.
s = pd.Series(np.random.rand(20)) # generate random  values
null_s = np.random.randint(0,20,5) # randomly choose 5 positions in s
s[null_s] = np.nan # insert 5 NaN

# Test for NA in s
pd.isna(s).sum()
5

key = np.random.randint(0,3, s.size) # grouping key

# Set NA values to group mean
transformed_s = s.groupby(key).transform(lambda x: x.fillna(x.mean()))

# Test for NA in transformed_s
pd.isna(transformed_s).sum()
0

Filtration

The filter method returns a subset of the original object

  • The argument of filter must be a function that, applied to the group as a whole, returns True or False
# Filtering groups with less than 5 items
s.groupby(key).filter(lambda x: x.size >5)
  • Instead of dropping groups, they can be returned filled with NA
# Filtering groups with less than 5 items
s.groupby(key).filter(lambda x: x.size >5, dropna=False)
  • For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion

Method Chaining

Method chaining is a programmatic style of invoking multiple method calls sequentially with each call performing an action on the same object and returning it. It eliminates the cognitive burden of naming variables at each intermediate step

advantage of Method chaining is that it is a top-down approach with arguments placed next to the function unlike the nested calls, where tracking down respective function calls to its arguments is demanding.

Pandas provide several functions for method chaining: namely every function that takes as input a DataFrame and returns a DataFrame is valid for piping

Starting from version 0.16.2 pandas defined the function pipe that allows piping of user defined function.

  • pros READABILITY
  • cons DEBUGGING

Pandas Excercises

The file gene_table.txt contains summary annotation on all human genes, based on the Ensembl annotation

For each gene, this file contains:

  • gene_name based on the HGNC nomenclature
  • gene_biotype for example protein_coding, pseudogene, lincRNA, miRNA etc. See here for a more detailed description of the biotypes
  • chromosome on which the gene is located
  • strand on which the gene is located
  • transcript_count the number of known isoforms of the gene
head(gene_table.txt)

gene_name,gene_biotype,chromosome,strand,transcript_count
TSPAN6,protein_coding,chrX,-,5
TNMD,protein_coding,chrX,+,2
DPM1,protein_coding,chr20,-,6
SCYL3,protein_coding,chr1,-,5
C1orf112,protein_coding,chr1,+,9

Tasks

  1. load the dataset
  2. compute the number of genes annotated for the human genome
  3. compute the minimum, maximum, average and median number of known isoforms per gene (consider the transcript_count column as a series).
  4. plots a histogram of the number of known isoforms per gene
  5. compute the number of different biotypes
  6. compute, for each gene_biotype, the number of associated genes, and print the gene_biotype with the number of associated genes in decreasing order
  7. compute, for each chromosome, the percentage of genes located on the + strand
  8. compute, for each biotype, the average number of transcripts associated to genes belonging to the biotype

Solutions

  1. load the dataset
>>> import pandas as pd
>>> df = pd.read_csv('path/to/file')
  1. compute the number of genes annotated for the human genome
>>> df.gene_name.unique().size
51327
  1. compute the minimum, maximum, average and median number of known isoforms per gene
>>> df.transcript_count.apply(['min','max','mean','median'])
min         1.00000
max       170.00000
mean        3.70047
median      1.00000
  1. plots a histogram of the number of known isoforms per gene
df.transcript_count.plot.hist(bins=100, xticks=[])

  1. compute the number of different biotypes
>>> (
     df.
     gene_biotype
     .astype('category')
     .cat
     .categories
     .size
    )
41
  1. compute, for each gene_biotype, the number of associated genes, and print the gene_biotype with the number of associated genes in decreasing order
>>> (
     df
     .groupby('gene_biotype')
     .size()
     .sort_values(ascending=False)
    )
  1. compute, for each chromosome, the percentage of genes located on the + strand
>>> out = df.groupby('chromosome')['strand'].agg([lambda x: round(x.eq('+').sum()/x.size, 2)])
>>> out.head()

            <lambda>
chromosome          
chr1            0.51
chr10           0.51
chr11           0.50
chr12           0.50
chr13           0.49
  1. compute, for each biotype, the average number of transcripts associated to genes belonging to the biotype
>>> df.groupby('gene_biotype')['transcript_count'].mean()

gene_biotype
3prime_overlapping_ncRNA    1.153846
IG_C_gene                   1.285714
IG_C_pseudogene             1.000000
IG_D_gene                   1.000000
IG_J_gene                   1.000000
Name: transcript_count, dtype: float64

Accessory modules

Builtin Functions

  • zip(iterables)) Make an iterator that aggregates elements from each of the iterables
list(zip(['A','B','C'],(0,1,2)))

[('A',0),('B',1),('C',2)]
  • enumerate(iterable, start=0) Return an enumerate object. iterable must be a sequence, an iterator, or some other object which supports iteration. The next() method of the iterator returned by enumerate() returns a tuple containing a count (from start which defaults to 0) and the values obtained from iterating over iterable
for i in enumerate(['A','B','C']):
    print(i)
    
(0,'A')
(1,'B')
(2,'C')

Itertools

This module implements a number of iterator building blocks

  • product(iterables, [repeat]) Cartesian product of input iterables. Roughly equivalent to nested for-loops in a generator expression
list(product([0,1], ['A','B']))

[(0, 'A'), (0, 'B'), (1, 'A'), (1, 'B')]
  • combinations(iterable, r)) Return r length subsequences of elements from the input iterable
list(combinations(['A','B','C'], r=2))

[('A', 'B'), ('A', 'C'), ('B', 'C')]
  • permutations(iterable,r=None) Return successive r length permutations of elements in the iterable. If r is not specified or is None, then r defaults to the length of the iterable and all possible full-length permutations are generated
list(permutations(['A','B','C'])

[('A', 'B'), ('A', 'C'), ('B', 'A'), ('B', 'C'), ('C', 'A'), ('C', 'B')]

collections

This module implements specialized container datatypes providing alternatives to Python’s general purpose built-in containers

  • Counter([iterable-or-mapping]) A Counter is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values
c = Counter('banana')
c

Counter({'a': 3, 'n': 2, 'b': 1})

c.most_commmon(1) # List items by frequency

[('a',3)]

c.get('b') # Get element count

1
  • defaultdict(default_factory) When a key is added for the first time, an entry is automatically created using the default_factory function
s = [('yellow', 1), ('blue', 2), ('yellow', 3), ('blue', 4), ('red', 1)]
d = defaultdict(list)

for k,v in s:
    d[k].append(v) # the first time a key is encountered an empty list is created, therefore `.append()` method is guaranteed to work

d
defaultdict(<class 'list'>, {'yellow': [1, 3], 'blue': [2, 4], 'red': [1]})
  • OrderedDict Ordered dictionaries are just like regular dictionaries but have some extra capabilities relating to ordering operations. They have become less important now that the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7)

Re

This module provides regular expression matching operations similar to those found in Perl

NB Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

The solution is to use Python’s raw string notation for regular expression patterns:r"str". So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation

  • compile(pattern) Compile a regular expression pattern into a regular expression object, which can be used for matching
p = compile(r'cat')
  • match(pattern, string, flags=0) If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern
s1 = "O'Malley the alley cat, that's right, and I'm very proud of that"
s2 = "categorical"

for s in [s1,s2]:
    m = p.match(s)
    
    if m is None:
        print(f'No match found for pattern \"{p}\"')
    else:
        print(m)

No match found for pattern "re.compile('cat')"
<re.Match object; span=(0, 3), match='cat'>
  • search(pattern, string) Scan through string looking for the first location where this regular expression produces a match, and return a corresponding match object
s3 = "pandas.Series.cat.categories"

for s in [s1,s2,s3]:
    m=p.search(s)
    
    if m is None:
        print(f'No match found for pattern \"{p}\"')
    else:
        print(m)

<re.Match object; span=(19, 22), match='cat'>
<re.Match object; span=(0, 3), match='cat'>
<re.Match object; span=(14, 17), match='cat'> # ONLY the first
  • findall Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
for s in [s1,s2,s3]:
    m = p.findall(s)
    
    if m is None:
        print(f'No match found for pattern \"{p}\"')
    else:
        print(m)
        
['cat']
['cat']
['cat','cat']
// reveal.js plugins